First, we load our modules.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Then, we load our dataset and obtain some basic information about it.
df = pd.read_csv("midterm_df1.csv")
df.info()
df
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1019 entries, 0 to 1018 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 x1 1019 non-null object 1 x2 1019 non-null float64 2 x3 1019 non-null float64 3 x4 1019 non-null float64 4 x5 1019 non-null object 5 x6 1019 non-null float64 6 x7 1019 non-null float64 7 x8 1019 non-null float64 8 y 1019 non-null float64 dtypes: float64(7), object(2) memory usage: 71.8+ KB
| x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | y | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | D | 54.00 | 1040.0 | 676.0 | low | 0.0 | 2.5 | 162.0 | 79.986111 |
| 1 | D | 54.00 | 1055.0 | 676.0 | low | 0.0 | 2.5 | 162.0 | 61.887366 |
| 2 | J | 33.25 | 932.0 | 594.0 | low | 142.5 | 0.0 | 228.0 | 40.269535 |
| 3 | K | 33.25 | 932.0 | 594.0 | low | 142.5 | 0.0 | 228.0 | 41.052780 |
| 4 | F | 26.60 | 932.0 | 670.0 | low | 114.0 | 0.0 | 228.0 | 47.029847 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1014 | D | 27.64 | 870.1 | 768.3 | low | 116.0 | 8.9 | 179.6 | 44.284354 |
| 1015 | D | 32.22 | 817.9 | 813.4 | low | 0.0 | 10.4 | 196.0 | 31.178794 |
| 1016 | D | 14.85 | 892.4 | 780.0 | low | 139.4 | 6.1 | 192.7 | 23.696601 |
| 1017 | D | 15.91 | 989.6 | 788.9 | low | 186.7 | 11.3 | 175.6 | 32.768036 |
| 1018 | D | 26.09 | 864.5 | 761.5 | low | 100.5 | 8.6 | 200.6 | 32.401235 |
1019 rows × 9 columns
Above, we see the head and tail of the data set. There are 9 variables, 8 inputs x1 to x8 and 1 output y. It looks like all inputs except for x1 and x5 are continuous (these two are categorical). This data set has 1019 observations.
df.nunique()
x1 11 x2 280 x3 284 x4 304 x5 2 x6 187 x7 155 x8 205 y 930 dtype: int64
The above displays the number of unique values for each variable. There are 11 unique categories for x1 and two unique categories for x5.
df.isna().sum()
x1 0 x2 0 x3 0 x4 0 x5 0 x6 0 x7 0 x8 0 y 0 dtype: int64
There are no missing values in this data set.
In the section below, we explore counts of the categorical variables, x1 and x5.
x1¶Below are the number of observations represented by each category of x1, which as we saw previously had 11 values.
df.groupby(['x1']).size()
x1 A 134 B 126 C 62 D 425 E 91 F 54 G 22 H 52 I 26 J 13 K 14 dtype: int64
fig, ax = plt.subplots(figsize=(12,8))
mycount = sns.countplot(data = df,
order = df['x1'].value_counts().index,
x='x1',
ax=ax)
for p in mycount.patches:
height = p.get_height()
at_midpoint = p.get_x() + p.get_width()/2.
mycount.text(at_midpoint, height + 0.25,
height,
ha='center')
plt.show()
A majority of the dataset is made up of observations associated with x1 = "D", followed by "A", "B", and "E".
x5¶Below are the number of observations represented by each category of x1.
df.groupby(['x5']).size()
x5 high 247 low 772 dtype: int64
There are two categories of x5, "low" and "high" and a majority of the observations in this dataset are associated with "low".
x1 and x5¶Below, we look at the number of observations in all possible combinations of these two categorical variables. This could result in up to 22 combinations.
pd.crosstab(df['x1'], df['x5'])
| x5 | high | low |
|---|---|---|
| x1 | ||
| A | 36 | 98 |
| B | 8 | 118 |
| C | 28 | 34 |
| D | 111 | 314 |
| E | 36 | 55 |
| F | 0 | 54 |
| G | 0 | 22 |
| H | 28 | 24 |
| I | 0 | 26 |
| J | 0 | 13 |
| K | 0 | 14 |
Most combinations of x1 and x5 exist except for when x5 = "high" and x1 = F, G, I, J, or K.
sns.catplot(data = df,
y='x1',
col='x5',
kind='count')
plt.show()
Here, we see that across both values of x5, x1 = "D" has the greatest number of observations. The number of observations for x1 = "A" and "E" show a similar pattern. Lastly, even though there are more observations for x5 = "low", there seem to be relatively similar numbers of observations for x1 = "C" and "H" across the two x5 groups.
Below, we explore the distributions and relations between the continuous variables.
sns.pairplot(df)
plt.show()
This plot is a lot to digest. I'll do my best to talk about some important things that stood out to me.
Along the diagonal, we see the distributions of the continuous variables. There are a few variables that appear to have one peak or average near the center of their distributions, namely x3, x4, x8, and y. The other two variables, x6 and x7 have two peaks, one at values close to or equal to 0 and another one toward the center of their distributions.
The rest of the plot shows the relations between the continuous variables. The majority of these plots look like clouds, i.e., there's no linear relation between the variables. However, there are a few exceptions:
x2 and y appear to be positively correlatedx8 appears to be negatively correlated with x3 and x4x7 and x8 seem to be negatively correlatedWe can look at a correlation plot to verify these visual trends.
fig, ax = plt.subplots(figsize = (9,9))
df_numeric = df.select_dtypes(include=np.number)
sns.heatmap(data = df.corr(),
vmin = -1, vmax = 1, center = 0,
cmap = "coolwarm",
annot = True,
annot_kws = {"size": 14}
)
plt.show()
As stated previously and as shown in the above correlation plot, x2 and y are positively correlated, and x8 is negatively correlated with x4 and x7.
y¶Next, we look at the relation between the categorical inputs, x1 and x5 and the output y. None of the continuous inputs are considered yet.
x1 and y¶fig, ax = plt.subplots(figsize = (12, 8))
sns.violinplot(data = df,
x='x1',
y='y',
inner=None,
ax=ax)
sns.swarmplot(data = df,
x='x1',
y='y',
color='k',
ax=ax,
size = 3)
plt.show()
Ranges of y vary by x1. For instance, for x1 = "D", y ranges from 0 to above 80, whereas for x1 = "G", y ranges from around 50 to above 80. Also, the distributions and average values of y differ based on categories of x1. For instance, there seem to be two peaks or modes of y for x1 = "G", and the mean of y in "G" is higher than the mean of y in "H". With the exception of x1 = "K" and "E" (and "G"), all distributions of y seem to have one peak. Values of y for x1 = "E" seem to be relatively evenly distributed across the range of about 10 to 90.
x5 and y¶fig, ax = plt.subplots(figsize = (12, 8))
sns.violinplot(data = df,
x='x5',
y='y',
inner=None,
ax=ax)
sns.swarmplot(data = df,
x='x5',
y='y',
color='k',
ax=ax,
size = 5)
plt.show()
In contrast to the previous plot, there doesn't seem to be a huge difference between the distribution of y based on the groups of x5. The range, average value, and variability of y seems to be similar across the "low" and "high" groups.
y based on categorical inputs¶In this last section, we examine whether there are differences in the associations between the continuous inputs and the output based on x1 and x5.
x1¶This variable has 11 categories, which makes it hard to visualize. I think it might be useful to lump some of the categories together. Below, I show a conditional pairs plot just to demonstrate how hard it is to see if the relations between the variables differ by x1.
sns.pairplot(data = df,
hue='x1',
diag_kws={'common_norm': False})
plt.show()
The scatterplot between y and x2 makes me think that this is the caret dataset.
Anyway, below, I create a new variable based on x1. Specifically, I want to compare the group that is most represented in x1 ("D") to all other groups.
df['x1_binary'] = np.where(df['x1']=='D', 'D', 'All Other')
df.head()
| x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | y | x1_binary | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | D | 54.00 | 1040.0 | 676.0 | low | 0.0 | 2.5 | 162.0 | 79.986111 | D |
| 1 | D | 54.00 | 1055.0 | 676.0 | low | 0.0 | 2.5 | 162.0 | 61.887366 | D |
| 2 | J | 33.25 | 932.0 | 594.0 | low | 142.5 | 0.0 | 228.0 | 40.269535 | All Other |
| 3 | K | 33.25 | 932.0 | 594.0 | low | 142.5 | 0.0 | 228.0 | 41.052780 | All Other |
| 4 | F | 26.60 | 932.0 | 670.0 | low | 114.0 | 0.0 | 228.0 | 47.029847 | All Other |
Now, we can examine whether the relation between the continuous inputs and y differ based on whether x1 = "D" or "All Other".
sns.pairplot(data = df,
hue='x1_binary',
diag_kws={'common_norm': False})
plt.show()
Examining just the relations between the continuous inputs, we see that these relations do not differ based on whether x1 = "D" or "All Other". Further, now just looking at the bottom row or right column, we see that the relations between y and the continuous outputs do not differ based on this grouping of x1. It is possible that some other grouping makes more sense, but I don't know what these variables represent so it is hard (for me) to decide how to group these.
After some thinking though, I decided to return to the first conditional pairs plot and look again.
x1 = "D" has the most observations and it appears "all over" the plots showing the relation between the inputs and y, so it is hard to see what is happening with the other inputs.
It seems that for the relation between x2 and y, when x1 = "D", the two aforementioned variables are positively related, just like the overall trend suggests. However, there appear to be clusters of observations corresponding to the other values of x1. This is shown and discussed below.
sns.lmplot(data = df,
x = "x2",
y = "y",
hue = "x1",
col = "x1")
plt.show()
I forget (or maybe we didn't go over it but I cannot find it!) how to change the grid arrangement of these plots when there are a lot of groups. '
I know the plots will appear pretty small in the HTML file so from left to right, here are the groups we are seeing: D (red), J (orange), K (mustard yellow), F (olive green), I (green), A (green/blue) , B (turquoise), E (blue), G (purple), C (magenta), H (pink).
Anyway, here is a summary of the trends based on x1:
x1 = "D", "J", "F", "I", "A", "B", "E", "C", and "H", there is a positive relation between x2 and y.x1; for instance, observations corresponding to x1 = "G" occupy the top right part of the scatter plot meaning these are relatively large values of y and x2. In contrast, x1 = "A" takes on a wide range of values of x2 but only takes on values of y that are below 40.x1 = "K", there seems to be no relation between x2 and y.x1 = "G", there seems to be a negative relation.x5¶sns.pairplot(data = df,
hue='x5',
diag_kws={'common_norm': False})
plt.show()
Overall, it does not look like the relations between the continuous inputs and y differ based on x5.